# YouTube API Replication Package

This repository provides a complete replication package for collecting and analyzing YouTube data using geospatial-temporal queries. The package includes data collection scripts, text analysis tools, and spatial visualization components for studying location-based YouTube content and comments.

## Table of Contents

- [Overview](#overview)
- [Requirements](#requirements)
- [Installation](#installation)
- [Configuration](#configuration)
- [Data Collection Pipeline](#data-collection-pipeline)
- [Text Analysis](#text-analysis)
- [Spatial Analysis](#spatial-analysis)
- [Correlation analysis](#correlation-analysis)
- [Directory Structure](#directory-structure)
- [Replication Guide](#replication-guide)
- [Data Management](#data-management)
- [Troubleshooting](#troubleshooting)

## Overview

This replication package enables researchers to:
- Collect YouTube video metadata and comments based on geographic locations and time periods
- Perform sentiment and content analysis on collected comments using WordNet-based dictionaries
- Generate spatial visualizations and perform geospatial interpolation analysis
- Conduct country-level comparative analysis using socioeconomic indicators

The package is designed with complete reproducibility in mind, featuring automated environment setup, comprehensive logging, and modular components.

## Requirements

### System Requirements
- Linux operating system (tested on Ubuntu/Debian)
- Conda package manager
- Python 3.12+
- R 4.0+
- Minimum 8GB RAM
- At least 50GB free disk space for data storage

### API Access
- YouTube Data API v3 key (required)
- World Bank API access (for comparative analysis)

## Installation

### 1. Clone the Repository
```bash
git clone <repository-url>
cd youtube-api-replication-package
```

### 2. Create Conda Environment
```bash
# Create environment from specification file
conda env create -f environment.yml

# Activate the environment
conda activate youtube-api-replication-package
```

### 3. Verify Installation
```bash
# Test Python dependencies
python -c "import requests, pandas, geopandas, nltk; print('Python dependencies OK')"

# Test R dependencies (if using R components)
Rscript -e "library(sf); library(ggplot2); cat('R dependencies OK\n')"
```

## Configuration

### 1. API Key Setup
Create a `.env` file in the `code/python/` directory:
```bash
cp code/python/.env.example code/python/.env
```

Edit `.env` and add your YouTube API key:
```
API_KEY=your_youtube_api_key_here
```

### 2. Project Configuration
Edit `config/config.yml` to specify your target country and time period:
```yaml
country_code: mx                           # ISO 2-letter country code
country_name: mexico                       # Country name (lowercase)
start_datetime: '2021-04-01T00:00:00Z'    # Start date (ISO format)
end_datetime: '2022-03-01T23:59:59Z'      # End date (ISO format)
espg: 4326                                 # Coordinate system
pop_threshold: 100                         # Minimum population threshold
radius: 7070m                              # Search radius for geographic queries
```

## Data Collection Pipeline

The data collection follows a sequential 5-step process:

### Step 1: Generate Location Grid (`00_cn_locations.py`)
```bash
cd code/python/youtube-api-scrape
python 00_cn_locations.py
```
- Reads population raster data and shapefiles
- Creates a grid of locations with population > threshold
- Outputs: `{country_code}-locations.txt`

### Step 2: Create Query Database (`01_create_query_log_db.py`)
```bash
python 01_create_query_log_db.py
```
- Creates SQLite database for tracking queries
- Loads location data from Step 1
- Outputs: `{date}_{country_code}_query_log.db`

### Step 3: Search Videos (`02_search_videos.py`)
```bash
python 02_search_videos.py
```
- Performs geospatial YouTube searches using API
- Collects video metadata for each location/time period
- Outputs: `{date}_{country_code}_youtube_data.db`

### Step 4: Process Video Data (`03_chunk_tables.py`)
```bash
python 03_chunk_tables.py
```
- Optimizes database structure for large datasets
- Creates indexed tables for efficient comment retrieval

### Step 5: Collect Comments (`04_get_video_comments.py`)
```bash
python 04_get_video_comments.py
```
- Retrieves all comments for collected videos
- Stores comment data with metadata
- Handles API rate limiting and pagination

### Automated Pipeline
Run the complete pipeline for multiple time periods:
```bash
python main.py
```

## Text Analysis

### WordNet-based Dictionary Creation
```bash
cd code/python/text-analysis
python 01_wordnet_dict_es.py
```
Creates weighted dictionaries for sentiment analysis using Spanish WordNet.

### Comment Analysis
```bash
python 02_word_frequency.py
python 03_scalar_sum.py
```
- Analyzes comment text for predefined concepts
- Generates scalar sentiment scores
- Outputs: CSV files with sentiment metrics

### Visualization
```bash
python 00_visualise_wordnet.py
```
Creates network visualizations of WordNet relationships.

## Spatial Analysis

### R-based Spatial Analysis
```bash
cd code/R
Rscript main.R
```
- Performs country clustering analysis
- Creates comparative visualizations
- Generates spatial interpolation maps

### Inverse Distance Weighting
```bash
Rscript do-idw.R
```
Performs spatial interpolation of sentiment scores across geographic grids.

## Correlation Analysis

The correlation analysis is described by and can be replicated with the following Stata do-files:

- gid_analysis_baseline.do
- gid_analysis_appendix.do
- gid_maps.do

## Directory Structure

```
youtube-api-replication-package/
├── code/
│   ├── python/
│   │   ├── youtube-api-scrape/     # Data collection scripts
│   │   ├── text-analysis/          # NLP and sentiment analysis
│   │   ├── utils/                  # Shared utilities
│   │   └── .env.example           # API key template
|   ├── stata/                     # Stata do-files   
│   ├── R/                         # Spatial analysis scripts
│   └── tex/                       # LaTeX documents   
├── config/
│   └── config.yml                 # Project configuration
├── data/                          # Input data (shapefiles, rasters)
├── output/                        # Results and visualizations
├── logs/                          # Processing logs
├── environment.yml                # Conda environment
└── README.md                      # This file
```

## Replication Guide

### Complete Replication for Mexico (2021-2022)
1. **Setup environment** (Steps 1-3 in Installation)
2. **Configure for Mexico**:
   ```yaml
   country_code: mx
   country_name: mexico
   start_datetime: '2021-04-01T00:00:00Z'
   end_datetime: '2022-03-01T23:59:59Z'
   ```
3. **Ensure data availability**:
   - Place Mexican shapefile in `data/mexico/mx-query-setup/state-shape/`
   - Place population raster in `data/mexico/mx-query-setup/pop-grid/`
4. **Run pipeline**:
   ```bash
   cd code/python/youtube-api-scrape
   python main.py
   ```
5. **Analyze results**:
   ```bash
   cd ../text-analysis
   python 01_wordnet_dict_es.py
   python 02_word_frequency.py
   python 03_scalar_sum.py
   ```

### Adapting for Other Countries
1. **Obtain geographic data**:
   - Country shapefile (administrative boundaries)
   - Population raster data (WorldPop or similar)
2. **Update configuration**:
   - Set appropriate `country_code` and `country_name`
   - Adjust `pop_threshold` and `radius` as needed
3. **Place data files** in `data/{country_name}/` following the directory structure
4. **Run pipeline** with new configuration

## Data Management

### File Naming Convention
- Location files: `{YYYYMM}_{YYYYMM}_{country_code}-locations.txt`
- Query databases: `{YYYYMM}_{YYYYMM}_01_{country_code}_query_log.db`
- Video databases: `{YYYYMM}_{YYYYMM}_02_{country_code}_youtube_data.db`
- Results: `{YYYYMM}_{YYYYMM}_05_{country_code}_comments_scalar_score.csv`

### Data Storage Guidelines
1. **Raw API data**: Never commit to version control
2. **Intermediate results**: Store in `output/{country_name}/`
3. **Final datasets**: Archive according to institutional data management policies
4. **Logs**: Maintain for troubleshooting and audit purposes

### Database Schema

#### Query Log Table
```sql
CREATE TABLE query (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    latitude TEXT,
    longitude TEXT,
    publishedAfter TEXT,
    publishedBefore TEXT,
    query_completed INTEGER DEFAULT 0
);
```

#### Video Metadata Table
```sql
CREATE TABLE videos (
    video_id TEXT PRIMARY KEY,
    publishedAt DATETIME,
    channelId TEXT,
    title TEXT,
    description TEXT,
    channelTitle TEXT,
    viewCount INTEGER,
    likeCount INTEGER,
    commentCount INTEGER,
    -- Additional metadata fields...
);
```

## Troubleshooting

### Common Issues

**API Rate Limiting**
- YouTube API has quotas; monitor usage in Google Cloud Console
- Script includes automatic retry logic with exponential backoff
- Consider running over multiple days for large datasets

**Memory Issues**
- Large databases may require chunking (handled by `03_chunk_tables.py`)
- Increase system memory or reduce batch sizes in scripts

**Missing Dependencies**
- Ensure all conda packages installed: `conda env update -f environment.yml`
- For R dependencies: Install missing packages as indicated in error messages

**Geographic Data Issues**
- Verify coordinate reference systems match (EPSG:4326 recommended)
- Ensure population raster and shapefile cover the same geographic area

### Validation Steps
1. **Check data completeness**: Verify location file contains expected number of points
2. **Validate API responses**: Ensure video metadata fields are populated
3. **Verify geographic coverage**: Plot collected locations to confirm spatial distribution
4. **Test text analysis**: Run on small sample to verify sentiment scoring

### Performance Optimization
- Use multiprocessing for R spatial analysis: `plan(multisession, workers = N)`
- Consider database indexing for large comment datasets
- Monitor disk space during collection (comments can be large)

## Citation

When using this replication package, please cite:

```
Amarasinghe, A., Nanlohy, S., Morgan, T., Hammond, D., Dahiya, Y., & Bailo, F. (2025, May 30). Mapping Violence Perceptions Through YouTube Comments: A New Approach to Real-Time Conflict Monitoring. https://doi.org/10.17605/OSF.IO/FA493
```


## Contact

For questions about replication or technical issues:
- Yashdeep.Dahiya@sydney.edu.au
- Francesco.Bailo@sydney.edu.au


